Below we'll be performing analysis on football players dataset provided to us. We'll be performing list of steps mentioned below in our analysis:
We'll start by importing default libraries for our purpose which we'll be using for analysis below.
import pandas as pd ## For maintaining dataframes.
import numpy as np ## For arrays
import matplotlib.pyplot as plt ## Data Visualization Library.
import plotly.graph_objects as go ## Data Visualization Library.
import plotly.express as px ## Data Visualization Library.
import hvplot.pandas ## Data Visualization Library.
import holoviews as hv ## Data Visualization Library.
import seaborn as sns ## Data Visualization Library.
import warnings ## Supress unwanted warnings
warnings.filterwarnings('ignore') ## Supress unwanted warnings
## Below command plots matplotlib plots into notebook
%matplotlib inline
We'll start by loading CSV file and printing it's shape for information purpose.
df = pd.read_csv('CompleteDataset.csv')
print('dataset Size : ', df.shape)## Print (rows, columns)
df.head() ## Prints first 5 rows
Below We are cleaning data of columns which has special characters. We'll be remvoing special characters from Name, Club,Wage and Value columns. We also remove rows which has any NA values.
df = df.dropna() ## Drop rows which has NA values
## Replace special chracts by using "ascii" encoding. It loops through all strings and ignore chracters for which it can't find ascii mappings.
## This way all special chracters will be handled.
df['Name'] = [name.encode().decode('ascii', 'ignore') if isinstance(name,str) else name for name in df.Name]
df['Club'] = [club.encode().decode('ascii', 'ignore') if isinstance(club, str) else club for club in df.Club]
## We'll replace special character euro from Wage column. Then we'll remove K and multiply values by 1000.
df['Wage'] = df['Wage'].str.replace('€', '')
df['Wage'] = df['Wage'].str.replace('K', '')
df['Wage'] = df['Wage'].astype(np.int32) * 1000
## We'll remove special chracter euro from Value column.
df['Value'] = df['Value'].str.replace('€', '')
## Below method takes as input value and if it has K then multiplies by 1000 and if it has M then multiplies by 1000000.
def convert_values(val):
if 'K' in val:
new_val = val.replace('K','')
new_val = float(new_val) * 1000
return new_val
elif 'M' in val:
new_val = val.replace('M','')
new_val = float(new_val) * 1000000
return new_val
else:
return float(val)
## We'll replace all Values with appropriate conversion
df['Value'] = np.array([convert_values(val) for val in df['Value']], dtype=np.float32)
## Removing unwanted whitespaces from strings
df['Club'] = df['Club'].str.strip()
df['Nationality'] = df['Nationality'].str.strip()
df.head()
Below we are selecting columns which are fit for analysis purpose. We are not taking column Unnamed, Photo, Flag and club flag as it won't be much helpful for analysis.
## We have taken ouc columns names Unnamed, Photo, Flat, Club Flag and ID as hey seems unusable columns for analysis.
useful_columns = ['Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage', 'Special',
'Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control', 'Composure', 'Crossing',
'Curve', 'Dribbling', 'Finishing', 'Free kick accuracy', 'GK diving', 'GK handling',
'GK kicking', 'GK positioning', 'GK reflexes', 'Heading accuracy', 'Interceptions', 'Jumping', 'Long passing',
'Long shots', 'Marking', 'Penalties', 'Positioning', 'Reactions', 'Short passing', 'Shot power',
'Sliding tackle', 'Sprint speed', 'Stamina', 'Standing tackle', 'Strength', 'Vision', 'Volleys']
df = df[useful_columns]
df.head()
df.dtypes ## For checking datatype of each column. object data type refers to string.
As we can see majority of columns are of object type, we'll be only considering columns with datatype as integer or float. We'll be considering columns like Nationality and Club as well for analysis which are of string type.
We'll be considering columns Nationality, Club, Age, Overall, Potential, Value, Wage and Special for analysis.
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].max()
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].min()
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].mean()
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].median()
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].mode()
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].std()
Correlation tells us how each columns are related to each other and how much relation exist between them.
correlation = df.corr()
correlation
df.describe()
df = df.sort_values(by =["Club", "Wage"])
df[['Club', 'Name', 'Wage']].head()
players_grouped_by_countries = df.sort_values("Nationality").set_index('Nationality')[['Name']]
players_grouped_by_countries.head(10)
players_grouped_by_countries.tail(10)
average_players_potential_by_age = df.groupby(by='Age').mean()[['Potential']]
average_players_potential_by_age
average_players_potential_by_age.hvplot(kind='bar', title='Average Potential Of Player By Age')
List of Findings:
Below we are using heatmap to show correlation between age, overall, potential, value, wage and special which we calculated above. We are also showing colorbar which show intensity of relation. Correlation tells us how much particular column is related to another column.
with plt.style.context(('seaborn', 'ggplot')): ## It gives look of seaborn and ggplot to matplotlib.
fig = plt.figure(figsize=(10,10)) ## Declaring figure
plt.matshow(correlation, cmap=plt.cm.Blues, fignum=1) ## Matshow for plotting heatmapt with blues colormap.
plt.xticks(range(len(correlation.index)), correlation.index) # Naming xticks
plt.yticks(range(len(correlation.index)), correlation.index) # Naming yticks
plt.title("Heat Map displaying correlation")
plt.colorbar()
plt.grid(None)
## Below code loops through all values of correlation and adds them in as red text in chart
for i in range(len(correlation.index)):
for j in range(len(correlation.index)):
plt.text(j,i, "%.2f"%correlation.values[i,j],
horizontalalignment='center',
verticalalignment='center',
color='red',
fontsize=12)
List of Findings:
Below we are plotting count of players per country by considering only countries which has more than 100 players to avoid graph from getting very crowded.
%%opts Bars [height=500 width=900 xrotation=90 title="Number Of Players Per Country(>100 Players)"]
## Groupby players per nationlity and count them and then take Name column and rename it to count
number_of_players_per_country = df.groupby(by="Nationality").count()[['Name']].rename(columns={'Name':'Count'})
## Sort players by Count
number_of_players_per_country = number_of_players_per_country.sort_values(by="Count")
## Consider only rows where number of players are more than 100
number_of_players_per_country = number_of_players_per_country[number_of_players_per_country.Count > 100]
## Plot bar chart showing count
number_of_players_per_country.hvplot(kind="bar", color="tomato")
List of Findings:
Below we are plotting bar chart showing average wage per club using holviews library. We are only considering entries where average wage is above 50k euro otherwise chart will become crowded to understand if we include all clubs.
%%opts Bars [height=500 width=900 xrotation=90 title="Average Wage Per Club(>50,000 Euro)"]
## Group rows by club and take mean of Wage. Rename Wage column to Average_wage
avg_wage_by_club = df.groupby(by="Club").mean()[['Wage']].rename(columns={'Wage':'Average_Wage'})
## Sort rows by average_wage
avg_wage_by_club = avg_wage_by_club.sort_values(by="Average_Wage")
## Take only rows where average wage is greater than 50k.
avg_wage_by_club = avg_wage_by_club[avg_wage_by_club.Average_Wage > 50000]
avg_wage_by_club.hvplot(kind="bar", color="lawngreen")
List Of Findings:
Below we are plotting scatter plot showing relation between age and wage. We also have color-encoded markers according to potential.
plt.figure(figsize=(15,8))
sns.scatterplot(data=df, x="Age", y="Wage" , size="Potential", hue="Potential")
plt.title("Scatter plot explaining relationship between Age and Wage color-encoded by Potential");
List of Findings:
Below we are using pandas internal scatter matrix function which plots scatter matrix using matplotlib library. Scatter matrix is plotted for rows Age, Overall, Potential, Value, Wage and Special. All diagonal plots of scatter matrix are histograms showing distribution of that column and all non-diagonal charts are scatter plot showing relation between that 2 columns intersecting there.
pd.plotting.scatter_matrix(df, figsize=(18,18), diagonal='hist', color="tomato");
List of Findings:
Below we are plotting bar chart showing average (Overall,Potential) per club for top 10 clubs according to most value of Overall and Potential.
%%opts Bars [height=500 width=1100 xrotation=90 title="Top 10 Clubs By Average Potential & Overall"]
## Group rows by club and then take mean of Potential and Overall.
avg_potential_overall_by_club = df.groupby(by="Club").mean()[['Potential', 'Overall']]
## Sort values by first Potential and then Overall.
avg_potential_overall_by_club = avg_potential_overall_by_club.sort_values(by=["Potential","Overall"], ascending=False)
## Plot them as bar chart.
avg_potential_overall_by_club.head(10).hvplot(y=['Potential', 'Overall'], kind="bar")
List of Findings:
Below we have used pie chart to display distribution of players from all countries. These countries has most players in dataset.
players_count_per_country = df.groupby("Nationality").count()[['Name']].rename(columns={"Name":"Count"})
players_count_per_country = players_count_per_country.sort_values(by="Count", ascending=False)#.head(10)
fig = go.Figure(data=[go.Pie(labels=players_count_per_country.index, values=players_count_per_country.Count)])
fig.update_layout(height=800,width=800, title="Countries with number of players")
fig.show()
List of Findings: